4 Main Analysis (20 pts)
4.1 Data Cleaning Process
# install.packages('geosphere')
library(tidyverse)
df <-read_csv("taxi_sample_2016.csv", col_types = cols(X1 = col_skip()))
df %>% rename(pickup_datetime = tpep_pickup_datetime,
dropoff_datetime = tpep_dropoff_datetime) -> df
# data cleaning
df %>% mutate (
Duration = (dropoff_datetime - pickup_datetime) / 60,
Average_speed = trip_distance / as.numeric(Duration / 60),
Fare_rate = fare_amount / trip_distance,
Euclidean_distance = 0.000621371192 *
geosphere::distVincentyEllipsoid(p1 = df[, c('pickup_longitude', 'pickup_latitude')],
p2 = df[, c('dropoff_longitude', 'dropoff_latitude')]),
Winding_factor = trip_distance / Euclidean_distance) %>%
filter(
between(pickup_longitude, -74.052, -73.7443),
between(pickup_latitude, 40.57, 40.877),
between(trip_distance, 0, 30),
between(fare_amount, 1, 105),
between(Fare_rate, 1, 11),
between(passenger_count, 0, 6),
between(Duration, 2, 120),
between(Euclidean_distance, 0, 15),
between(Winding_factor, 0.95, 6),
between(Average_speed, 3.1, 55),
between(total_amount, 0, 300),
improvement_surcharge >= 0,
tolls_amount <= 300,
tip_amount <= 100,
mta_tax >= 0,
extra >= 0,
dropoff_longitude < 0,
passenger_count > 0,
RatecodeID <= 6
) -> clean_df
error_rate <- 1 - dim(clean_df)[1] / dim(df)[1]
print(paste("Filtered erroneous data rate:", as.character(error_rate)))## [1] "Filtered erroneous data rate: 0.08175"
By looking at the univariate and joint distributions of variables, we found a few outliers. These outliers contains unreasonable values, including trips that fall in oceans, pickup locations far away from New York City, taxis that traveled negative distances, unreasonably expensive fares, trips with total distance shorter than the Euclidean distance from origin to destination and so on. These records are identified by looking at the visualised distributions, filtered out according to the selected range listed in table 2, and then marked as erroneous data. After applying the filter rules in table 2, we found around 13.4% erroneous data records.
| Field Name | Range | Description |
|---|---|---|
| pickup_longitude | [-74.052,-73.744] | Taxi pickup locations filtered within NYC |
| pickup_latitude | [40.57,40.88] | Taxi pickup locations filtered within NYC |
| trip_distance | [0,30] miles | filtered within a reasonable range of taxi trip distance |
| fare_amount | [1,105] USD | filtered within a reasonable range of expense for a taxi trip |
| Fare_rate* | [1,11] USD/mile | Fare_Rate = Fare_amount / Trip_distance |
| passenger_count | (0,6] | a taxi is not likely to carry more than 6 people |
| Duration* | [2,120] min | Taking a taxi for less than 2 minutes or longer than 2 hours are less common |
| Euclidean_distance* | [0,15] miles | The derived straight line distance from origin to destination. Calculated using the Vincenty formula |
| Winding_factor* | [0.95,6] | Winding_factor= Trip_distance / Euclidean_distance. If winding factor <1, it violates Euclidean distance limit. If it is too large, the track is not reasonable. |
| Average_speed* | [3.1,55] mph | Average_speed= trip_distance/duration. Set from walking speed (3.1) to speed limit (55) |
| total_amount | [0,300] | Spending more than $300 for taking a taxi is less common |
| improvement_surcharge | >=0 | Constraint to positive |
| tolls_amount | <=300 | more than $300 tolls is less common |
| tip_amount | <=100 | Giving more than $100 for a taxi trip is less common |
| mta_tax | >0 | a small number of erroneous records are found to have negative mta_tax |
| extra | >0 | |
| dropoff_longitude | <0 | |
| RatecodeID <= 6 | there are a few erroneous records with RatecodeID > 6, the max is 6 |
Note: * are derived variables.
4.2 Cleaned Data Distribution Visual
After cleaning the data and adding the derived features, we repeat the distribution visualization process to check the thorough distribution. We still plot the graphs by categorical variables and descrete variables.
Categorical: passenger_count, RatecodeID, store_and_fwd_flag, payment_type, extra, mta_tax, improvement_surcharge
Discrete: trip_distance, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, fare_amount, tip_amount, tolls_amount, total_amount, Fare_rate, Duration, Euclidean_distance, Winding_factor, Average_speed
bar_1 <- ggplot(clean_df, aes(VendorID))+geom_bar(fill="steelblue", width=0.5)
bar_2 <- ggplot(clean_df, aes(passenger_count))+geom_bar(fill="steelblue", width=0.5)
bar_3 <- ggplot(clean_df, aes(RatecodeID))+geom_bar(fill="steelblue", width=0.5)
bar_4 <- ggplot(clean_df, aes(store_and_fwd_flag))+geom_bar(fill="steelblue", width=0.5)
bar_5 <- ggplot(clean_df, aes(payment_type))+geom_bar(fill="steelblue", width=0.5)
bar_6 <- ggplot(clean_df, aes(extra))+geom_bar(fill="steelblue")
bar_7 <- ggplot(clean_df, aes(mta_tax))+geom_bar(fill="steelblue", width=0.1)
bar_8 <- ggplot(clean_df, aes(improvement_surcharge))+geom_bar(fill="steelblue", width=0.01)
ggarrange(bar_1,bar_2,bar_3, bar_4, bar_5, bar_6, bar_7, bar_8,
ncol = 2, nrow = 4)After the data cleaning, all categorical data appear to already exclude the outlier, especially for the improvement_surcharge, all data is 0.3; though the data has some imbalanced feature, it shouldn’t be too problematic.
hist_1 <- ggplot(clean_df, aes(trip_distance))+ geom_histogram(binwidth=2, fill="steelblue") + labs(y = "Frequency")
hist_2 <- ggplot(clean_df, aes(pickup_longitude))+ geom_histogram(binwidth=0.01,fill="steelblue")+ labs(y = "Frequency")
hist_3 <- ggplot(clean_df, aes(pickup_latitude))+ geom_histogram(binwidth=0.01,fill="steelblue")+ labs(y = "Frequency")
hist_4 <- ggplot(clean_df, aes(dropoff_longitude))+ geom_histogram(binwidth=0.01,fill="steelblue")+ labs(y = "Frequency")
hist_5 <- ggplot(clean_df, aes(dropoff_latitude))+ geom_histogram(binwidth=0.01,fill="steelblue")+ labs(y = "Frequency")
hist_6 <- ggplot(clean_df, aes(fare_amount))+ geom_histogram(binwidth=10,fill="steelblue")+ labs(y = "Frequency")
hist_7 <- ggplot(clean_df, aes(tip_amount))+ geom_histogram(binwidth=10,fill="steelblue")+ labs(y = "Frequency")
hist_8 <- ggplot(clean_df, aes(tolls_amount))+ geom_histogram(binwidth=10,fill="steelblue")+ labs(y = "Frequency")
hist_9 <- ggplot(clean_df, aes(total_amount))+ geom_histogram(binwidth=10,fill="steelblue")+ labs(y = "Frequency")
ggarrange(hist_1,hist_2,hist_3, hist_4, hist_5, hist_6, hist_7, hist_8, hist_9,
ncol = 3, nrow = 3)Aftter the cleaning, the discrete data also looks much nicer, among them pick_up longitude and latitude seems distribute more cloe than the drop-off ones, this makes sense because for taxis pick up guests in NYC, they do have some chances to drive far outside the city. For total_amount, some price over 100 make the distribution not that pretty, but still acceptable in reality. In the next step, we will move on take a look of the derived important features, those features should play a very important role in the later analysis, combined with our pick_up longotude, latitude and trip_distance. Let’s first take a look of the boxplot.
par(mar=c(10,4,4,4))
important <- c('trip_distance','Duration','Average_speed','Fare_rate','pickup_latitude','pickup_longitude','Winding_factor','Euclidean_distance')
important_df <- clean_df[,important]
boxplot(important_df, las=2)The boxplot is not that much informative, since they have so different range and it’s meaningless to standardize those features. So histogram can be a better choice in this case.
The graph below shows our picked features, with a more careful picked binwidth.
hist_1 <- ggplot(clean_df, aes(x=trip_distance))+ geom_histogram(binwidth=0.2, fill="steelblue") + labs(x = 'Trip distance (miles)', y = "Frequency")
hist_2 <- ggplot(clean_df, aes(Duration))+ geom_histogram(binwidth=0.5, fill="steelblue")+ labs(x = 'Trip duration (minutes)', y = "Frequency")
hist_3 <- ggplot(clean_df, aes(Average_speed))+ geom_histogram(binwidth=0.2, fill="steelblue")+ labs(x = 'Average speed (mph)', y = "Frequency")
hist_4 <- ggplot(clean_df, aes(Fare_rate))+ geom_histogram(binwidth=0.05, fill="steelblue")+ labs(x = 'Fare rate (USD/mile)', y = "Frequency")
hist_5 <- ggplot(clean_df, aes(pickup_latitude))+ geom_histogram(binwidth=0.002, fill="steelblue")+ labs(x = 'Latitude', y = "Frequency")
hist_6 <- ggplot(clean_df, aes(pickup_longitude))+ geom_histogram(binwidth=0.002, fill="steelblue")+ labs(x = 'Longitude', y = "Frequency")
hist_7 <- ggplot(clean_df, aes(Winding_factor))+ geom_histogram(binwidth=0.02, fill="steelblue")+ labs(x = 'Winding Factor', y = "Frequency")
hist_8 <- ggplot(clean_df, aes(Euclidean_distance))+ geom_histogram(binwidth=0.1, fill="steelblue")+ labs(x = 'Euclidean distance (miles)', y = "Frequency")
ggarrange(hist_1,hist_2,hist_3, hist_4, hist_5, hist_6, hist_7, hist_8, ncol = 2, nrow = 4)At this step, we filtered, cleaned the data, derived some interesting features, visualized all the features independently and picked the most important features for further analysis. In next subsection, we will explore the travel patterns, both in the temporal dimension and spatial dimension.
4.3 Explore the Temporal-Spatial Taxi Travel Patterns in NYC
library('lubridate')
library(dplyr)
clean_df<-na.omit(clean_df)
clean_df$day<-strptime(clean_df$pickup_datetime, "%Y-%m-%d")
clean_df$weekday<-weekdays(clean_df$day)
pick_up=strptime(clean_df$pickup_datetime,"%Y-%m-%d %H:%M:%OS")
drop_off=strptime(clean_df$dropoff_datetime,"%Y-%m-%d %H:%M:%OS")
clean_df$duration=round(drop_off-pick_up,digits=0)
clean_df$hour=hour(clean_df$pickup_datetime)clean_df$day<-as.character(clean_df$day)
clean_df %>%mutate(`day of week`= factor(weekday,
c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'))
) -> clean_df
library(ggthemes)
g_1 <- ggplot(data=clean_df,
aes(x=hour, y=Duration,color=`day of week`)) +
stat_summary(fun.y = mean,geom="point",aes(shape=`day of week`))+
stat_summary(fun.y = mean,geom="line",size=0.3)+
ggtitle("(a) Average trip duration in 24 hours")+
ylab("Average Duration (min)")+xlab("Hour of Day")+
scale_color_brewer(type ='qual' )+
scale_x_continuous(breaks = c(0:23) )
g_2 <- ggplot(data=clean_df,
aes(x=hour, y=Euclidean_distance,color=`day of week`)) +
stat_summary(fun.y = mean,geom="point",aes(shape=`day of week`))+
stat_summary(fun.y = mean,geom="line",size=0.3)+
ggtitle("(b) Average trip distance in 24 hours")+
ylab("Average trip distance (miles)")+xlab("Hour of Day")+
scale_color_brewer(type ='qual' )+
scale_x_continuous(breaks = c(0:23))
g_3 <- ggplot(data=clean_df,
aes(x=hour, y=Average_speed,color=`day of week`)) +
stat_summary(fun.y = mean,geom="point",aes(shape=`day of week`))+
stat_summary(fun.y = mean,geom="line",size=0.3)+
ggtitle("(c) Average travel speed in 24 hours")+
ylab("Average speed (mph)")+xlab("Hour of Day")+
scale_color_brewer(type ='qual' )+
scale_x_continuous(breaks = c(0:23))
clean_df %>%
group_by(day,hour)%>%
summarise(sum_passenger_count=sum(passenger_count) ) %>%
ungroup() -> temp4
temp4$day<-strptime(temp4$day, "%Y-%m-%d")
temp4$day_of_week<-weekdays(temp4$day)
temp4$day<-as.character(temp4$day)
temp4 %>%
mutate(`day of week`=factor(day_of_week,
c('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'),
ordered = TRUE),
workday = if_else(`day of week`<'Saturday','weekdays','weekends'),
workday = factor(workday)
) -> temp4
g_4 <-temp4%>%
group_by(`day of week`,hour) %>%
summarise(avg_passenger_pickups=mean(sum_passenger_count)) %>%
ggplot(aes(x=hour,y=avg_passenger_pickups,color=`day of week`))+
geom_line(size=0.3)+
geom_point(aes(shape=`day of week`),size=1)+
ggtitle("(d) Hourly pikcked up person counts")+
ylab("pickups (person per hour)")+xlab("Hour of Day")+
scale_color_brewer(type ='qual' )+
scale_x_continuous(breaks = c(0:23))
g_5 <-temp4%>%
group_by(workday,hour) %>%
summarise(avg_passenger_pickups=mean(sum_passenger_count)) %>%
ggplot(aes(x=hour,y=avg_passenger_pickups,color=workday))+
geom_line(size=0.8)+
geom_point(aes(shape=workday),size=2)+
ggtitle("(e) Hourly pikcked up person counts by weekdays and weekends")+
ylab("pickups (person per hour)")+xlab("Hour of Day")+
scale_color_discrete( )+
scale_x_continuous(breaks = c(0:23))
ggarrange(g_1,g_2,g_3,g_4,g_5, nrow = 5)Color Vision Deficiency
library("ggthemes")
g_5 <- g_1 + scale_colour_colorblind()
g_6 <- g_2 + scale_colour_colorblind()
g_7 <- g_3 + scale_colour_colorblind()
g_8 <- g_4 + scale_colour_colorblind()
ggarrange(g_5,g_6,g_7,g_8, ncol = 2, nrow = 2,
common.legend = TRUE,legend = "bottom")After exploring all the data, the results for average duration, trip distance, trip speed, and number of person trips aggregated by day of week are visualized.The first four graph are normal, the last four graphs are color vision defiency.
In Figure (a), during a week, there is a peak of average duration around at 15pm while a trough at 6am. Also, on weekends, even on Monday, people would like not to travel with a long duration during the whole day. As for Figure (b), it seems that there is a common trend for every day of a week. Besides, the obvious peak appears around 4am and after that, the average trip distance decline sharply to a quite low value and keep stable until next day’s 4am comes again. In Figure (c), each weekday also has a relatively same trend while weekends come with a little difference. Beside same peak and trough as Figure (b), what more interesting is that, the hourly average travel speed has a quite similar pattern with hourly average trip distance, but with a slightly large variance among each day of week. For Figure (d), it looks like that, hourly pickups has an opposite trend compared with average travel speed and average trip distance. Where hourly pickups has a peak, other two plots have a trough and vice versa. Another thing, that might be interesting is, weekdays seems have a similar trend as weekends, but with a time lag of approximately 3 hours. To make this point clearer, statistics aggregated by weekdays and weekends are shown in Figure (e).
library(viridis)
ggplot(clean_df, aes(x = pickup_longitude, y = pickup_latitude)) +
geom_point(alpha = 0.1,size = 0.1) +
ggtitle("Pick Up Location")+
geom_hex(bins=80)+
scale_fill_distiller(palette = 'Spectral')+
xlab('')+ylab('')Form heatmap, point graph,and 2d Kernel density graph, we can easy find that most taixes pick up on John F. Kennedy International Airport, LaGuardia Airport (LGA) and Mahatten expect up town. Also the highes density for taxies picking up passanger is at Midtown in Mahattan.